Prosper Loan Exploration by Peida Cai

Introduction

This report explores a dataset containing loan information for approximately 114,000 loans (with 81 variables) in lending company Prosper Marketplace. Prosper is a peer-to-peer lending company that links borrowers with lenders on an online platform for personal, short-termed, unsecured loans from ranging from $2,000 to $35,000. Loans are issued based on credit worthiness of the borrowers, ratings are assigned and “sold” to one or more “investors” willing to lend the money.

Univariate Plots Section

## 'data.frame':    113937 obs. of  81 variables:
##  $ ListingKey                         : Factor w/ 113066 levels "00003546482094282EF90E5",..: 7180 7193 6647 6669 6686 6689 6699 6706 6687 6687 ...
##  $ ListingNumber                      : int  193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
##  $ ListingCreationDate                : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
##  $ CreditGrade                        : Factor w/ 9 levels "","A","AA","B",..: 5 1 8 1 1 1 1 1 1 1 ...
##  $ Term                               : int  36 36 36 36 36 60 36 36 36 36 ...
##  $ LoanStatus                         : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
##  $ ClosedDate                         : Factor w/ 2803 levels "","2005-11-25 00:00:00",..: 1138 1 1263 1 1 1 1 1 1 1 ...
##  $ BorrowerAPR                        : num  0.165 0.12 0.283 0.125 0.246 ...
##  $ BorrowerRate                       : num  0.158 0.092 0.275 0.0974 0.2085 ...
##  $ LenderYield                        : num  0.138 0.082 0.24 0.0874 0.1985 ...
##  $ EstimatedEffectiveYield            : num  NA 0.0796 NA 0.0849 0.1832 ...
##  $ EstimatedLoss                      : num  NA 0.0249 NA 0.0249 0.0925 ...
##  $ EstimatedReturn                    : num  NA 0.0547 NA 0.06 0.0907 ...
##  $ ProsperRating..numeric.            : int  NA 6 NA 6 3 5 2 4 7 7 ...
##  $ ProsperRating..Alpha.              : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 3 3 ...
##  $ ProsperScore                       : num  NA 7 NA 9 4 10 2 4 9 11 ...
##  $ ListingCategory..numeric.          : int  0 2 0 16 2 1 1 2 7 7 ...
##  $ BorrowerState                      : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
##  $ Occupation                         : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
##  $ EmploymentStatus                   : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
##  $ EmploymentStatusDuration           : int  2 44 NA 113 44 82 172 103 269 269 ...
##  $ IsBorrowerHomeowner                : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
##  $ CurrentlyInGroup                   : Factor w/ 2 levels "False","True": 2 1 2 1 1 1 1 1 1 1 ...
##  $ GroupKey                           : Factor w/ 707 levels "","00343376901312423168731",..: 1 1 335 1 1 1 1 1 1 1 ...
##  $ DateCreditPulled                   : Factor w/ 112992 levels "2005-11-09 00:30:04.487000000",..: 14347 111883 6446 64724 85857 100382 72500 73937 97888 97888 ...
##  $ CreditScoreRangeLower              : int  640 680 480 800 680 740 680 700 820 820 ...
##  $ CreditScoreRangeUpper              : int  659 699 499 819 699 759 699 719 839 839 ...
##  $ FirstRecordedCreditLine            : Factor w/ 11586 levels "","1947-08-24 00:00:00",..: 8639 6617 8927 2247 9498 497 8265 7685 5543 5543 ...
##  $ CurrentCreditLines                 : int  5 14 NA 5 19 21 10 6 17 17 ...
##  $ OpenCreditLines                    : int  4 14 NA 5 19 17 7 6 16 16 ...
##  $ TotalCreditLinespast7years         : int  12 29 3 29 49 49 20 10 32 32 ...
##  $ OpenRevolvingAccounts              : int  1 13 0 7 6 13 6 5 12 12 ...
##  $ OpenRevolvingMonthlyPayment        : num  24 389 0 115 220 1410 214 101 219 219 ...
##  $ InquiriesLast6Months               : int  3 3 0 0 1 0 0 3 1 1 ...
##  $ TotalInquiries                     : num  3 5 1 1 9 2 0 16 6 6 ...
##  $ CurrentDelinquencies               : int  2 0 1 4 0 0 0 0 0 0 ...
##  $ AmountDelinquent                   : num  472 0 NA 10056 0 ...
##  $ DelinquenciesLast7Years            : int  4 0 0 14 0 0 0 0 0 0 ...
##  $ PublicRecordsLast10Years           : int  0 1 0 0 0 0 0 1 0 0 ...
##  $ PublicRecordsLast12Months          : int  0 0 NA 0 0 0 0 0 0 0 ...
##  $ RevolvingCreditBalance             : num  0 3989 NA 1444 6193 ...
##  $ BankcardUtilization                : num  0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
##  $ AvailableBankcardCredit            : num  1500 10266 NA 30754 695 ...
##  $ TotalTrades                        : num  11 29 NA 26 39 47 16 10 29 29 ...
##  $ TradesNeverDelinquent..percentage. : num  0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
##  $ TradesOpenedLast6Months            : num  0 2 NA 0 2 0 0 0 1 1 ...
##  $ DebtToIncomeRatio                  : num  0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
##  $ IncomeRange                        : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
##  $ IncomeVerifiable                   : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
##  $ StatedMonthlyIncome                : num  3083 6125 2083 2875 9583 ...
##  $ LoanKey                            : Factor w/ 113066 levels "00003683605746079487FF7",..: 100337 69837 46303 70776 71387 86505 91250 5425 908 908 ...
##  $ TotalProsperLoans                  : int  NA NA NA NA 1 NA NA NA NA NA ...
##  $ TotalProsperPaymentsBilled         : int  NA NA NA NA 11 NA NA NA NA NA ...
##  $ OnTimeProsperPayments              : int  NA NA NA NA 11 NA NA NA NA NA ...
##  $ ProsperPaymentsLessThanOneMonthLate: int  NA NA NA NA 0 NA NA NA NA NA ...
##  $ ProsperPaymentsOneMonthPlusLate    : int  NA NA NA NA 0 NA NA NA NA NA ...
##  $ ProsperPrincipalBorrowed           : num  NA NA NA NA 11000 NA NA NA NA NA ...
##  $ ProsperPrincipalOutstanding        : num  NA NA NA NA 9948 ...
##  $ ScorexChangeAtTimeOfListing        : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ LoanCurrentDaysDelinquent          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ LoanFirstDefaultedCycleNumber      : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ LoanMonthsSinceOrigination         : int  78 0 86 16 6 3 11 10 3 3 ...
##  $ LoanNumber                         : int  19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
##  $ LoanOriginalAmount                 : int  9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
##  $ LoanOriginationDate                : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
##  $ LoanOriginationQuarter             : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...
##  $ MemberKey                          : Factor w/ 90831 levels "00003397697413387CAF966",..: 11071 10302 33781 54939 19465 48037 60448 40951 26129 26129 ...
##  $ MonthlyLoanPayment                 : num  330 319 123 321 564 ...
##  $ LP_CustomerPayments                : num  11396 0 4187 5143 2820 ...
##  $ LP_CustomerPrincipalPayments       : num  9425 0 3001 4091 1563 ...
##  $ LP_InterestandFees                 : num  1971 0 1186 1052 1257 ...
##  $ LP_ServiceFees                     : num  -133.2 0 -24.2 -108 -60.3 ...
##  $ LP_CollectionFees                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_GrossPrincipalLoss              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_NetPrincipalLoss                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_NonPrincipalRecoverypayments    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PercentFunded                      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Recommendations                    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ InvestmentFromFriendsCount         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ InvestmentFromFriendsAmount        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Investors                          : int  258 1 41 158 20 1 1 1 1 1 ...
##  [1] "ListingKey"                         
##  [2] "ListingNumber"                      
##  [3] "ListingCreationDate"                
##  [4] "CreditGrade"                        
##  [5] "Term"                               
##  [6] "LoanStatus"                         
##  [7] "ClosedDate"                         
##  [8] "BorrowerAPR"                        
##  [9] "BorrowerRate"                       
## [10] "LenderYield"                        
## [11] "EstimatedEffectiveYield"            
## [12] "EstimatedLoss"                      
## [13] "EstimatedReturn"                    
## [14] "ProsperRating..numeric."            
## [15] "ProsperRating..Alpha."              
## [16] "ProsperScore"                       
## [17] "ListingCategory..numeric."          
## [18] "BorrowerState"                      
## [19] "Occupation"                         
## [20] "EmploymentStatus"                   
## [21] "EmploymentStatusDuration"           
## [22] "IsBorrowerHomeowner"                
## [23] "CurrentlyInGroup"                   
## [24] "GroupKey"                           
## [25] "DateCreditPulled"                   
## [26] "CreditScoreRangeLower"              
## [27] "CreditScoreRangeUpper"              
## [28] "FirstRecordedCreditLine"            
## [29] "CurrentCreditLines"                 
## [30] "OpenCreditLines"                    
## [31] "TotalCreditLinespast7years"         
## [32] "OpenRevolvingAccounts"              
## [33] "OpenRevolvingMonthlyPayment"        
## [34] "InquiriesLast6Months"               
## [35] "TotalInquiries"                     
## [36] "CurrentDelinquencies"               
## [37] "AmountDelinquent"                   
## [38] "DelinquenciesLast7Years"            
## [39] "PublicRecordsLast10Years"           
## [40] "PublicRecordsLast12Months"          
## [41] "RevolvingCreditBalance"             
## [42] "BankcardUtilization"                
## [43] "AvailableBankcardCredit"            
## [44] "TotalTrades"                        
## [45] "TradesNeverDelinquent..percentage." 
## [46] "TradesOpenedLast6Months"            
## [47] "DebtToIncomeRatio"                  
## [48] "IncomeRange"                        
## [49] "IncomeVerifiable"                   
## [50] "StatedMonthlyIncome"                
## [51] "LoanKey"                            
## [52] "TotalProsperLoans"                  
## [53] "TotalProsperPaymentsBilled"         
## [54] "OnTimeProsperPayments"              
## [55] "ProsperPaymentsLessThanOneMonthLate"
## [56] "ProsperPaymentsOneMonthPlusLate"    
## [57] "ProsperPrincipalBorrowed"           
## [58] "ProsperPrincipalOutstanding"        
## [59] "ScorexChangeAtTimeOfListing"        
## [60] "LoanCurrentDaysDelinquent"          
## [61] "LoanFirstDefaultedCycleNumber"      
## [62] "LoanMonthsSinceOrigination"         
## [63] "LoanNumber"                         
## [64] "LoanOriginalAmount"                 
## [65] "LoanOriginationDate"                
## [66] "LoanOriginationQuarter"             
## [67] "MemberKey"                          
## [68] "MonthlyLoanPayment"                 
## [69] "LP_CustomerPayments"                
## [70] "LP_CustomerPrincipalPayments"       
## [71] "LP_InterestandFees"                 
## [72] "LP_ServiceFees"                     
## [73] "LP_CollectionFees"                  
## [74] "LP_GrossPrincipalLoss"              
## [75] "LP_NetPrincipalLoss"                
## [76] "LP_NonPrincipalRecoverypayments"    
## [77] "PercentFunded"                      
## [78] "Recommendations"                    
## [79] "InvestmentFromFriendsCount"         
## [80] "InvestmentFromFriendsAmount"        
## [81] "Investors"

Overview of Prosper lending business: Number of loans listings per quarter

In general, Prosper has been on an upward trend for the loans issued. It gained momentum since inception in 2006 but saw an abrupt drop in business in Q4 2008. From wikipedia, SEC imposed a cease and desist order on Prosper, primarily due to the novel nature of peer to peer lending then, requiring Proper to register with SEC. Prosper relaunched in July 2009, after obtaining SEC registration and business recovered and surpassed previous peak in 2012.

There is a rapid increase in loan listings in from Q2 2013 onwards, with a peak in Q4 2013

Loan listing monthly distribution

Most popular months for listings are January, October and December, least popular months are March, April and May. Most popular day of the week for listings is Tuesday while least popular day of the week is Monday.

Credit scores

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     9.5   669.5   689.5   695.1   729.5   889.5     591

Upper and lower credit scores of borrowers for each loan are provided. Both these scores are overlaid together on a histogram.

From Prosper’s website, the minimum criteria to borrow is FICO score of 640 as of September 30, 2015. This accounts for the slight negative skew in the distribution.

Prosper’s credit scores have a mean of between 695.1, slighty better than the national average of 690 (according to Google). There are a good spread of loans with excellent credit scores of over 750 while number of loans dropped rapidly below credit scores of 600. This is in line with Prosper’s strategy of focusing on prime and superprime customers.

The logical follow-up question would be “is credit score directly proportional to the returns for an investor?” i.e. Is there a minimum credit rating below which lending is too risky? And would the returns on investment be higher for borrowers with higher credit scores?

Prosper Inhouse rating distribution

The number of loans is plotted with their inhouse loan rating system “AA” being the lowest risk and “HR” being highest risk. We see that C rating is the most common type of loans, and the distribution is positively skewed, meaning there are more loans of better credit ratings and lower risks, corresponding to the FICO score distributions.

Interestingly, the highest rated loans (“AA”) is also the least common type of loan, even lower than HR. This is contrary to Prosper’s strategy of targeting prime and superprime customers. Does this mean that returns on AA-rated loans are lower than HR-rated loans?

Who are the borrowers within each rating groups? Do high-income earners necessarily get lower-rate loans (better-ratings)?

Borrowers income range distribution

Most common income ranges of the borrowers are ($25,000 to $49,999) and ($50,000 to $74,999). Most common stated monthly income is just over $4,000.

When StatedMonthlyIncome was faceted by Prosper Ratings, it was interesting that there were representation of both high and low income borrowers across all ratings. Except for “AA” rating, the rest of the loans have largely similarly shaped distributions, with mode between $4,000 - $5,000 and a long tail, indicating there are high-income earners across all ratings. Distribution for “AA” is a much flatter, particularly across the monthly income range of $4,000 - $10,000.

We shall take a look at Monthly Income vs Ratings in the Bivariate analysis later to see if they have any correlation.

Terms of loan

There are only 3 terms of loan at Prosper, 12, 36 and 60 months. Most loans are 36 months (3 years) long followed by 60 months (5 years). We see that 3-year loans are most popular for all income ranges and across all ratings. However, the relative difference between 3-year and 5-year loan decreases as income increases. Higher income earners are more willing to extend their loan terms, or it could be more of such customers are eligible for longer term loans.

Borrower rate distribution

Modal borrowing rate stands at about 0.32. This seems to be counter-intuitive as we should expect a rather normally distributed borrowing rate from the credit score profiles of the borrowers seen so far. The second peak is at about 0.15.

Borrower rate distribution

When faceted by ratings, the “Borrower Rate” distribution becomes clearer. Most ratings have a somewhat normal distribution with the rates increasing as quality of rating decreases. Only High Risk rated loans have a unique distribution, where most of the loans are at “Borrower Rate” of 0.32.

Loan original amount distribution

Most common loan amount is $3,500, followed by $15,000 and $10,000. Interestingly, we see a dearth of loans between $25,000 and $35,000. This is probably because $35,000 is the maximum loan amount, hence borrowers may want to maximise their credit lines since they are already borrowing near the maximum amount.

Loan categories distribution

## 
##     0     1     2     3     4     5     6     7     8     9    10    11 
## 16965 58308  7433  7189  2395   756  2572 10494   199    85    91   217 
##    12    13    14    15    16    17    18    19    20 
##    59  1996   876  1522   304    52   885   768   771

From the list of variable definitions, we gathered that the top 5 most common categories of loans are (in descending order): “Debt Consolidation” (by far, accounting for over half of all loans), “Other”, “Home Improvement”, “Business” and “Auto”.

Debt to income ratio distribution

From their website, Prosper restricted borrowers’ debt to income ratio to below 0.5 as of September 30, 2015, this is evident from the positively skewed distribution, with mode of about 0.15.

Bank Card Utilization distribution

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.310   0.600   0.561   0.840   5.950    7604

Most borrowers utilised about 100% of their existing revolving bank cards. This is in line with the most common type of loans (Debt consolidation). When faceted by ratings, it become clearer that only from rating “C” onwards, do the borrowers start to overdraft on their bankcards (card utilization rate of over 1.0), but this type of borrowers are the minority, in line with the company’s strategy of prime and superprime borrowers.

Which cycle do borrowers mostly commonly first default in?

Default is typically defined as missed payment for over 120 days, or starting from the 5th cycle, hence we see the dearth between 1 and 5, and a sudden peak from 5th cycle onwards. Nevertheless, there are exceptional cases where the borrowers defaulted in cycle 0 and cycle 1, these could be those borrowers who went into bankruptcy almost immediately after taking a loan from Prosper.

It is also interesting as to why the modal first default cycle is at the 8th cycle. In addition, there is a small peak at the last cycle for 36 months term.

Net principal loss distribution

We see that the bulk of NetPrincipalLoss is less than $4,000, with mode of about $4,000. There is a sharp drop in count beyond $4,000 before stretching to a long tail extending to $25,000.

Estimated return distribution

Estimated return is the difference between estimated effective yield and “Estimated Loss”, assigned at the time of listing creation, applicable for loans after July 2009. Most common return is about 0.12 with a second peak at 0.08.

Univariate Analysis

What is the structure of your dataset?

There are almost 114,000 observations with 81 variables. The project was conducted from the viewpoint of an investor, focusing on the following variables:

  • Number of listings per quarter Prosper has been enjoying an upward trend in their peer-to-peer loans business since inception in 2006, with the exception of the SEC’s incident in late 2009. In particular, Prosper saw a surge in loan listings in late 2013.

  • Popular months for listings Most popular - > January, October and December Least popular - > March, April and May

  • Popular days of week for listings Most popular - > Tuesday Least popular - > Monday

  • Credit scores Prosper focuses on prime and superprime borrowers (with higher credit scores) with an mean of 695.1 and median of 689 (proportionately more borrowers with above median scores.), slightly better than national average of 690.

  • Inhouse Prosper loan ratings Most common loan rating is “C”. While the distribution is positively skewed (more “A”, “B” than “D”, “E”), there are much more High Risk (HR) loans than “AA”.

  • Borrowers’ income range Most common borrowers’ income range is ($25,000 to $49,999), followed closely by ($50,000 to $74,999), most common state monthly income is just over $4,000.

  • Terms of loans Most common loan term is 36 months, by far.

  • Borrower rate A bi-modal distribution with 0.32 and 0.15 as peaks. The peak at 0.32 is contributed mainly by “HR” loans.

Loan original amount - Most common amount borrowed is $3,500. There are indications that borrowers, if eligible, prefers to borrow the maximum amount of $35,000 instead of an amount between $25,000 and $35,000.

  • Loan Categories Most common category of loan, by far, is “Debt consolidation”

  • Debt to income ratio Most borrowers have debt to income ratio of less than 0.5, as part of the pre-requisites to list on Prosper (recently implemented, so there are some borrowers above the 0.5 threshold).

  • Bank Card Utilization rate Most borrowers have a utilization rate of over 0.5, with a mode at 1.0. This is in line with the most common type of loan as “Debt consolidation”. There are also interesting outliers with rates over 1.0 and a maximum rate at 5.95.

  • FirstCycleDefaulted Most common first cycle defaulted is the 8th cycle, with a stand-alone peak at 1st cycle.

  • Net principal loss Most common amount of net principal loss is about $4,000, with the majority of the losses less than $4,000.

  • Estimated return Most common return at 12%

What is/are the main feature(s) of interest in your dataset?

The main features of the dataset, from the investors’ point of view, is the balance between risk and returns (“Estimated Return”, which is net of “Estimated Loss”). Therefore, the focus will be on variables that have tangible effects on “Estimated Return”.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Prosper developed their inhouse ratings for every loan listing since 2009, effectively summarises the relationship between risk and return and segmented the loans into 7 different categories for the ease of the investors. As the entire dataset is separated into current and completed/closed loans, default rates can be measured for each category. Another feature is Loss Given Default (LGD) rates for each rating as well as their estimated losses (Probability of default x LGD).

Did you create any new variables from existing variables in the dataset?

  • Month of listing
  • Day of Week of listing
  • Average credit score was created using the average of Credit Upper and Lower Scores.
  • Default rates was created using loans with completed, chargedoff and default status (i.e. non-current loans).
  • Loss given default rates was created using Net Principal Loss divided by Original Loan Amount.
  • Expected loss was created using Default rates multiply by Loss Given Default rates.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

The anomalies in the features and operations to tidy and adjust the data were discussed in detailed under their respective sections.

Bivariate Plots Section

Scatterplot Matrix

##  [1] "BorrowerRate"              "EstimatedReturn"          
##  [3] "ListingCategory..numeric." "EmploymentStatusDuration" 
##  [5] "CurrentCreditLines"        "OpenCreditLines"          
##  [7] "OpenRevolvingAccounts"     "DelinquenciesLast7Years"  
##  [9] "PublicRecordsLast10Years"  "PublicRecordsLast12Months"
## [11] "RevolvingCreditBalance"    "BankcardUtilization"      
## [13] "AvailableBankcardCredit"   "DebtToIncomeRatio"        
## [15] "StatedMonthlyIncome"       "LoanOriginalAmount"       
## [17] "loan_month"

From a subset of the data, “Estimated Return” has a strong correlation with Borrower’s rate and Prosper’s rating and a moderate correlation with credit score.

Corrplot Matrix

Default rates based on inhouse ratings

The default rates were as expected, increasing as the rating of loans deterioriates. However, the loss given default rates were rather similar across all ratings. This means that once a loan defaults, the likelihood of recovery is very low no matter the rating of the loan.

Quarterly default and LGD rates

Prosper seems to be doing well in terms of reducing the default rates, however, the LGD rates have been going up since 2009. Data from Q1 2014 is not a complete set in that the data was collected before the end of the quarter hence the number of loan from that quarter is lower compared with the rest of the quarters.

CreditScoreAvg vs Prosper Ratings

## 
##  Pearson's product-moment correlation
## 
## data:  prosperloan$ProsperRating..numeric. and prosperloan$CreditScoreAvg
## t = 191.27, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.544155 0.553558
## sample estimates:
##       cor 
## 0.5488738

“CreditScoreAvg” and Prosper Rating have a strong correlation of 0.5489. Median “CreditScoreAvg” declines as rating deteriorates, with the exception of “HR” loans. Interesting to note that “Super-Prime” borrowers can be found across all loan ratings, indicating that there are other features determining the Prosper Rating besides CreditScoreAvg.

Estimated Returns vs Borrower rate

## 
##  Pearson's product-moment correlation
## 
## data:  prosperloan$BorrowerRate and prosperloan$EstimatedReturn
## t = 413.73, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8154276 0.8198876
## sample estimates:
##       cor 
## 0.8176699

The interesting observation here is that while the trend between “Estimated Return” and “Borrower Rate” is generally positive, there are streaks of upwards slopes that are almost linear. It would be interesting to conduct a multivariate analysis later with these two variables to observe the underlying reasons for these streaks.

Estimated Loss vs Borrower Rate

## 
##  Pearson's product-moment correlation
## 
## data:  prosperloan$EstimatedLoss and prosperloan$BorrowerRate
## t = 844.11, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9445764 0.9460085
## sample estimates:
##      cor 
## 0.945297

This is the pair of variables that has the highest correlation. However, similar to estimated return and “Borrower Rate” plot above, there are streaks on this plot that have zero gradient (horizontal). We will need to conduct multivariate analysis to figure out what are causing these streaks. I suspect the cause is the same as that for estimated return since return is a function of loss directly.

Estimated Return vs Prosper Ratings

The “Estimated Return” increase as loan ratings deteriorate. The variance of each rating also increases as ratings get worse, except for “HR” loans, where Prosper estimates a long tail of losses and below average returns. Loan rated in the middle (“B”, “C” and “D”) see most outliers above returns of 0.2. This could explain why “C” rated loans are the most common, as they are seen as providing the best balance between risk and return.

Stated Monthly Income vs Prosper Ratings (numeric)

## 
##  Pearson's product-moment correlation
## 
## data:  prosperloan$ProsperRating..numeric. and prosperloan$StatedMonthlyIncome
## t = 27.594, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.08763298 0.10097022
## sample estimates:
##        cor 
## 0.09430583

This plot shows that there is little to no correlation between income and loan rating, high-income borrowers can also have poor loan rating and vice-versa.

Estimated Return vs Income Range

Median “Estimated Return” decreases as borrowers’ income increases, contrary to my initial intuition that returns is positively correlated with borrower’s income. It seems there is a basis to the term “sandwich class”. Perhaps this is also due to the fact that higher-income borrowers can turn to traditional sources of credit like banks for better rates.

Estimated Loss vs Employment duration

## 
##  Pearson's product-moment correlation
## 
## data:  prosperloan$EmploymentStatusDuration and prosperloan$EstimatedLoss
## t = -11.415, df = 84832, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.04587760 -0.03243981
## sample estimates:
##         cor 
## -0.03916047

There are no obvious trend, except that majority of borrowers have employment status durations of less than 10 years when they apply for the loans. I was expecting a relationship of lower “Estimated Loss” with higher employment duration.

Estimated Returns vs Credit Scores

## 
##  Pearson's product-moment correlation
## 
## data:  prosperloan$CreditScoreAvg and prosperloan$EstimatedReturn
## t = -107.5, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3521408 -0.3402970
## sample estimates:
##        cor 
## -0.3462327

The blue lines bound 95% of the data, while the red line shows the median estimated return. Median returns decrease as credit score of borrows increase. This is mitigated somewhat by the lower occurences of negative returns for higher credit score borrowers compared to the higher frequency for borrowers with credit scores below 700.

Estimated Returns vs AvailableBankcardCredit

## 
##  Pearson's product-moment correlation
## 
## data:  prosperloan$AvailableBankcardCredit and prosperloan$EstimatedReturn
## t = -86.328, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2903205 -0.2779501
## sample estimates:
##        cor 
## -0.2841471

Overall the “Estimated Return” falls within the band of 0.05 and 0.15 across all “AvailableBankcardCredit”, except when “AvailableBankcardCredit” falls below $2,500, where variance of “Estimated Return” starts to increase. There are more loans above and below the common band of “Estimated Return”.

Estimated Returns vs BankcardUtilization

## 
##  Pearson's product-moment correlation
## 
## data:  prosperloan$BankcardUtilization and prosperloan$EstimatedReturn
## t = 47.475, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1542965 0.1674052
## sample estimates:
##       cor 
## 0.1608579

Estimated return is somewhat consistent across all bankcard utilization rates, with a weak correlation.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

In general, Prosper manages estimate return as a very tightly-controlled variable. Estimated return is a function of estimated yield and “Estimated Loss” and falls within highly defined range (generally 0.05 to 0.15) with few outliers above and under.

While their strategy was to focus on prime and superprime borrowers, the most common rating for loans are “B”, “C” and “D”, as opposed to the intuitive “A” or “AA”. This is primarily due to better returns provided by mid-rated loans. From the estimated return plots, after accounting for defaults and losses, mid-rated loans generate better returns.

This is probably a result of the nature of P2P lending industry, they are targeting borrowers who are not able to get favourable rates from traditional credit sources such as banks and are looking to consolidate their debts (mostly credit card). While it is their strategy to target prime and superprime borrowers, it is the middle class borrowers that offer the best rate of returns.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

While “Default Rates” are as expected, higher risk for higher returns, the “LGD Rates” are interesting in that they are almost uncorrelated with “Loan Ratings”. Once a borrow defaults, losses amount to over 70% of the principal, regardless of their initial rating.

Some of my initial intuitions on the importance of variables affecting an individual’s financial credibility are wrong, such as “debt-to-income ratio”, “employment duration” and “stated monthly income”. These turned out to have little correlation with their “Estimated Return” and losses.

What was the strongest relationship you found?

The strongest relationships was between “Estimated Loss” and “Borrower Rate” and “Estimated Return” and “Borrower Rate”. However, since “Borrower Rate” is a feature determined by Prosper (hence, will naturally correlates with “Estimated Loss” and return), we only focused on borrower’s features.

The 3 strongest borrower’s features are: - “CreditScoreAvg” (-0.4161) - “AvailableBankCardCredit” (-0.3078) - “BankCardUtilization” (0.2474)

All of which are only moderately correlated with “Estimated Return” (0.2 to 0.5).

Multivariate Plots Section

Estimated Returns vs CreditScoreAvg and Ratings

The 95% interval (blue lines) shows that once “CreditScoreAvg” goes above about 650, most loans generates at least 5% estimated returns (risk adjusted estimates). As we already observed in the bivariate analysis, most common loans are mid-rated loans, and this plot clearly shows the reason why. Highly rate loans “AA” and “A” has clearly lower estimated returns across all CreditScores compared with “B”, “C” and “D”. However, as ratings worsen to “E” and “HR”, we start to see “Estimated Return” falling below the 2.5 percentile, even negative returns.

Majority of the “AA” and “A” loans generates below median “Estimated Return”. Only when CreditScoreAvg of borrowers go over 800, do we see proportionaly more loan generating above median “Estimated Return”.

Estimated Returns vs AvailableBankCardCredit and Ratings

In general, “AvailableBankCardCredit” is rather independent of “Estimated Return”, evident from the relative straight horizontal lines formed by each loan rating across all “AvailableBankCardCredit”. Only when “AvailableBankCardCredit” drops below $5,000, are there noticeably more occurences of negative “Estimated Return”. Even then, these are mainly due to “HR” loans.

Estimated Returns vs AvailableBankCardCredit and CreditScore.bucket

Borrowers with super-prime and prime credit scores forms a band of “Estimated Return” at about 0.05 to about 0.15, though there are few outliers above and below. However, borrowers with “AvailableBankCardCredit of over $40,000” do seem to provide only positive “Estimated Return”.

Above “AvailableBankCardCredit” of $10,000, there are few occurences of Near-prime or Sub-prime borrowers generating negative “Estimated Return”.

Estimated Returns vs BankcardUtilization and Prosper Ratings

Similar to super-prime credit score, “AA” loans form a baseline “Estimated Return” bankcard utilization rates at about 0.05. Different from the plot above (with creditscore instead of Prosper rating), there are loans at both ends of 0.0 and 1.0 spectrum with negative “Estimated Return”. This is likely because borrowers with 0.0 BankcardUtilization may not necessarily mean they have little existing debt (hence a borrower with better credit) but it could also indicate borrowers with poor credit who were unable to get credit extended to them.

There are more borrowers at 1.0 BankcardUtilization rate with negative return as these are more likely borrowers who overstretched their credit limits and hence unable to make payments for their loans.

Estimated Returns vs BankcardUtilization and CreditScore.bucket

At the lower spectrum of Bankcard Utilization rate (< 0.25), we see more of the Super-prime borrowers having lower “Estimated Return” (< 0.03). The main takeaway seems to be that to avoid negative “Estimated Return”, borrower’s BankcardUtilization rate should be between 0.25 and 1.0, restricted to borrowers with prime and superprime creditscores. Near-prime and sub-prime borrowers offer the potential for higher returns but also has a much higher chance of losses.

Estimated Returns vs BorrowerRate and Prosper Ratings

This plot shows that “BorrowerRate” varies linearly with “Estimated Return”, holding rating constant. The gradient for each rating is similar, however, y-intercept is different, resulting in clear delineation by loan ratings. This is true for all ratings except for HR, where we can clearly see 3 different sub-categories of HR loans. Again, the value of “C” and “D” rated loans are evident here. For a given borrower rate between 0.2 and 0.3, “C” and “D” loans outperform the rest of the ratings in terms of “Estimated Return”, and provides more opportunities of higher than average “Estimated Return” of more than 0.15 than other ratings.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Were there any interesting or surprising interactions between features?

One of the surprising feature was the plot of “Estimated Returns” vs “BorrowerRate” and “Prosper Ratings”. During the second-order analysis, the underlying reason for “HR” loans having 3 separate lines between Estimated Return and Borrower Rate should be investigated. If possible, the identified reason can be used to incorporate “HR” loans into the portfolio of investment to maximise “Estimated Return” for each BorrowerRate.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

Yes, a linear model was attempted for Estimated Returns using Prosper Ratings. However, this model only accounted for 40% of the variance in Estimated Returns and it seems that adding CreditScoreAvg, BankcardUtilization and AvailableBankcardCredit did not result in noticeable changes in the R^2 value.


Final Plots and Summary

Plot One

Description One

The distribution of estimated return is bimodal in linear scale as it is, with outliers in the negative return of almost -0.2 and high positive returns of 0.25.

Plot Two - Estimated Return vs Prosper Ratings

Description Two

With the exception of “AA” and “A” rated loans, all ratings see borrowers with a distribution of credit scores across the spectrum from 640 upwards (640 was a Prosper-imposed baseline). Median credit scores decreases with worsening loan ratings, except for “HR” loans which rose slightly.

Plot Three - Estimated Returns vs CreditScoreAvg and Ratings

Description Three

Loans form horizontal bands across the spectrum of Credit Scores within their ratings, indicating that estimated returns is rather uncorrelated with credit score. This is probably due to the function of estimated returns taking into account borrowers with poorer credit scores and charging higher interests. It is also evident in this plot that Prosper design their products to reap highest risk-adjusted returns from mid-rated loans, “C” and “D”.


Reflection

The Prosper dataset contained almost 114,000 loans, each with 81 variables starting from their inception in 2006. The report took the approach of an investor with the objective of getting the best risk-adjusted returns from the different loans. We started by investigating individually, both lender-defined variables and borrowers’ related variables, thereafter explored pairs of variable where there were interesting features from the individual plots. We also investigated the “default rates”, “loss given default” and “expected loss” for each rating before finally combining multiple variables of the interesting features to observe any potential interaction amongst them.

There was not a particular feature of the borrower’s data which correlated strongly with estimated return. With the exception of “AA” and “HR” loans, most loans fell within an estimated return band of 0.05 to 0.15, each forming a clear sub-band, with estimated returns increasing with worsening ratings, barring outliers. This indicated that Prosper designed each rating to clearly account for their respective risk-levels therefore only generated a very specific band of returns (scatterplots forming horizontal colored lines).

Personally, the surprise of this project was the revelation that “borrower’s income”, “employment duration” and “Debt-to-income ratio”, features which I had previously thought to have significant influence on a borrower’s credit rating did not feature strongly.

Finally, it would be interesting to be able to compare historical returns versus estimated returns for each rating. Although this was not provided in the dataset, it could have been computed using various columns of the existing data. However, since the formula for calculating returns, yields and losses were not provided, it was not attempted, lest inaccurate conclusions were drawn from the analyses.